Statistics 2

1st Project Assignment

Selecting Data Set

Description of our data set:

An individual’s annual income results from various factors.

Columns: The dataset contains 16 columns

We decided to drop the two columns: capital-gain, capital-loss. while investigating the data we saw that the histogram of those variables are very dense (most records has the same value). Obviusly these variables have small variance, which indicates that they not explain the data well and we can't conclude new information about the data from them. Graphs are provided later on.

Data set's number of records: 48842

Link to the data set: https://www.kaggle.com/wenruliu/adult-income-dataset?select=adult.csv

Initial Analysis of the Data

Imports

Numeric features analysis

Numeric features analysis

Dropped Features Analysis

As we can see from the graphs, these variables have very dense values, most of them are zeros. In both variables we see that ~98 % of the data is zero. Therefore, as we explained in the first paragraph, we have dropped these features from our data.

Research Questions

A. regression between 2 continuous variables:

Does an increase in years of education (educational-num) cause a decrease in the number of working hours per week?

B. regression between continuous variable and binary variable:

Does an increase in number of working hours per week cause a decrease in the probability of making more than 50K dollars per year?

C. test question:

Does the value of years of education distribute differently between genders?